Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: halt-height behavior is not deterministic #16639

Merged
merged 6 commits into from
Jun 27, 2023

Conversation

yihuang
Copy link
Collaborator

@yihuang yihuang commented Jun 21, 2023

Description

Closes: #16638

Solution:

  • make sure in state machine that we don't execute block beyond halt-height

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

Solution:
- make sure in state machine that we don't execute block beyond halt-height
@yihuang yihuang requested a review from a team as a code owner June 21, 2023 22:38
@yihuang yihuang changed the title Problem: halt-height behavior is not deterministic fix: halt-height behavior is not deterministic Jun 21, 2023
baseapp/abci.go Outdated
Comment on lines 648 to 654
func (app *BaseApp) FinalizeBlock(req *abci.RequestFinalizeBlock) (*abci.ResponseFinalizeBlock, error) {
var events []abci.Event

// don't execute blocks beyond the halt height
var halt bool
switch {
case app.haltHeight > 0 && uint64(req.Height) > app.haltHeight:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change potentially affects state.

Call sequence:

(*github.com/cosmos/cosmos-sdk/baseapp.BaseApp).FinalizeBlock (baseapp/abci.go:648)

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@testinginprod testinginprod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I would like a test case to be added in order to make sure that in case of refactors of the baseapp code we don't miss behaviours.

@julienrbrt julienrbrt added the backport/v0.50.x PR scheduled for inclusion in the v0.50's next stable release label Jun 22, 2023
Copy link
Contributor

@alexanderbez alexanderbez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow. If I say, for example, halt-height == 1000, then when Commit is called indicating block completion, the process will exit after 1000. How is it non-deterministic? Why is the logic duplicated in FinalizeBlock?

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 22, 2023

I don't quite follow. If I say, for example, halt-height == 1000, then when Commit is called indicating block completion, the process will exit after 1000. How is it non-deterministic? Why is the logic duplicated in FinalizeBlock?

because sending signal to self process is asynchronous, the current process can run for a while before the signal is handled and stop the node, I seems to observe that when I use --halt-height with local blocks that it don't stop at the expected height exactly.

@alexanderbez
Copy link
Contributor

alexanderbez commented Jun 22, 2023

Mhhh, so os.Exit() is async?? Or do you mean that os.Exit() actually sends a signal, which is caught in the server package, which case that is obviously async, yes.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 22, 2023

Mhhh, so os.Exit() is async?? Or do you mean that os.Exit() actually sends a signal, which is caught in the server package, which case that is obviously async, yes.

I think it only calls os.Exit if signal sending fails

@alexanderbez
Copy link
Contributor

I'm sorry, I don't follow. When the halt-height is reached, os.Exit() will be called:

https://github.com/cosmos/cosmos-sdk/blob/v0.47.3/baseapp/abci.go#L479-L485

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 23, 2023

I'm sorry, I don't follow. When the halt-height is reached, os.Exit() will be called:

https://github.com/cosmos/cosmos-sdk/blob/v0.47.3/baseapp/abci.go#L479-L485

You can check the internal of halt method, it don't call os.Exit unconditionally

@alexanderbez
Copy link
Contributor

You can check the internal of halt method, it don't call os.Exit unconditionally

I don't understand what this means. os.Exit() is called when the halt-height is reached, immediately exiting the process AFAIU.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 24, 2023

You can check the internal of halt method, it don't call os.Exit unconditionally

I don't understand what this means. os.Exit() is called when the halt-height is reached, immediately exiting the process AFAIU.


func (app *BaseApp) halt() {
	app.logger.Info("halting node per configuration", "height", app.haltHeight, "time", app.haltTime)

	p, err := os.FindProcess(os.Getpid())
	if err == nil {
		// attempt cascading signals in case SIGINT fails (os dependent)
		sigIntErr := p.Signal(syscall.SIGINT)
		sigTermErr := p.Signal(syscall.SIGTERM)

		if sigIntErr == nil || sigTermErr == nil {
			return
		}
	}

	// Resort to exiting immediately if the process could not be found or killed
	// via SIGINT/SIGTERM signals.
	app.logger.Info("failed to send SIGINT/SIGTERM; exiting...")
	os.Exit(0)
}

os.Exit() is not called if signal sending is successful, it's a fallback if signal sending failed, it's not ideal because the shutdown would not be graceful.

Copy link
Collaborator

@odeke-em odeke-em left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thank you @yihuang! Would be nice though to have a test to lock-in this behavior to prevent future regressions. It is critical code yet tomorrow I can sneakishly delete it and no tests will fail.

@alexanderbez
Copy link
Contributor

Wow I didn't see the PID logic -- completely forgot about that. I believe I originally wrote this code, but I can't recall why not just resort to os.Exit() always?

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 26, 2023

Wow I didn't see the PID logic -- completely forgot about that. I believe I originally wrote this code, but I can't recall why not just resort to os.Exit() always?

maybe for graceful shutdown, although right now we don't have critical stuff to cleanup, but for async commit (memiavl is doing that already), it'll be helpful to wait for it to complete before shutdown.
And I'm not sure if the Commit event of the last block should return successfully for cometbft to finish some post logic.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 26, 2023

LGTM and thank you @yihuang! Would be nice though to have a test to lock-in this behavior to prevent future regressions. It is critical code yet tomorrow I can sneakishly delete it and no tests will fail.

no problem, I just feel the solution is not ideal yet, first of all, there are code duplication between Commit and BeginBlocker, and x/upgrade has done something similar just by panic, so I think maybe just do the panic in BeginBlocker would be good enough, aka. remove the signal sending and os.Exit stuff from Commit? WDYT @alexanderbez

@alexanderbez
Copy link
Contributor

alexanderbez commented Jun 27, 2023

LGTM and thank you @yihuang! Would be nice though to have a test to lock-in this behavior to prevent future regressions. It is critical code yet tomorrow I can sneakishly delete it and no tests will fail.

no problem, I just feel the solution is not ideal yet, first of all, there are code duplication between Commit and BeginBlocker, and x/upgrade has done something similar just by panic, so I think maybe just do the panic in BeginBlocker would be good enough, aka. remove the signal sending and os.Exit stuff from Commit? WDYT @alexanderbez

Yes, a panic in BeginBlock is perfectly reasonable IMO. Just make sure it's triggered at HaltHeight + 1

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 27, 2023

Yes, a panic in BeginBlock is perfectly reasonable IMO. Just make sure it's triggered at HaltHeight + 1

done, I think the equivalence of panic in latest abci is just simply returning errors.

Comment on lines 649 to 655

if err := app.checkHalt(req.Height, req.Time); err != nil {
return nil, err
}

if err := app.validateFinalizeBlockHeight(req); err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change potentially affects state.

Call sequence:

(*github.com/cosmos/cosmos-sdk/baseapp.BaseApp).FinalizeBlock (baseapp/abci.go:649)

@alexanderbez alexanderbez added this pull request to the merge queue Jun 27, 2023
Merged via the queue into cosmos:main with commit c0e9e8c Jun 27, 2023
mergify bot pushed a commit that referenced this pull request Jun 27, 2023
Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com>
(cherry picked from commit c0e9e8c)

# Conflicts:
#	CHANGELOG.md
julienrbrt added a commit that referenced this pull request Jun 27, 2023
)

Co-authored-by: yihuang <huang@crypto.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
yihuang added a commit to yihuang/chain-main that referenced this pull request Jul 4, 2023
github-merge-queue bot pushed a commit to crypto-org-chain/chain-main that referenced this pull request Jul 5, 2023
* Problem: halt-height is not deterministic

Solution:
- port the fix from sdk: cosmos/cosmos-sdk#16639

* Update CHANGELOG.md

Signed-off-by: yihuang <huang@crypto.com>

---------

Signed-off-by: yihuang <huang@crypto.com>
mhofman pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 13, 2023
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
mhofman pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 19, 2023
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
mhofman added a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 19, 2023
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
michaelfig pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Nov 9, 2023
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
michaelfig pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Dec 3, 2023
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
michaelfig pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Jul 5, 2024
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
JeancarloBarrios pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 28, 2024
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
JeancarloBarrios pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 28, 2024
Port of cosmos#16639

Co-authored-by: yihuang <huang@crypto.com>
@faddat faddat mentioned this pull request Nov 8, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/v0.50.x PR scheduled for inclusion in the v0.50's next stable release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: halt-height is not deterministic
7 participants